online recommendation
A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation
Reinforcement learning is effective in optimizing policies for recommender systems. Current solutions mostly focus on model-free approaches, which require frequent interactions with a real environment, and thus are expensive in model learning. Offline evaluation methods, such as importance sampling, can alleviate such limitations, but usually request a large amount of logged data and do not work well when the action space is large. In this work, we propose a model-based reinforcement learning solution which models the user-agent interaction for offline policy learning via a generative adversarial network. To reduce bias in the learnt policy, we use the discriminator to evaluate the quality of generated sequences and rescale the generated rewards. Our theoretical analysis and empirical evaluations demonstrate the effectiveness of our solution in identifying patterns from given offline data and learning policies based on the offline and generated data.
Reviews: Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation
Originality: The proposed approach is a novel combination of well-known techniques such as RL and GAN for recommendation. Related work has been adequately cited. It is clear how the proposed approach differs from the existing literature. Quality: The approach appears to be technically sound. The theoretical analysis and the experiments support the claims.
Reviews: Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation
The reviewers overall felt positively about this paper, though scores were somewhat marginal. The reviews, while not glowing, overall lean toward acceptance of the paper: the reviewers feel the work is technically sound, the method is practical and effective, related work is good, and the experiments are convincing. There are some more tentative comments regarding the novelty/originality, being mostly a combination of existing techniques (R3), however this issue seems not to be a dealbreaker, and the difference compared to existing work is clear. There are a few clarifying points that seem to be addressed in the rebuttal.
A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation
Reinforcement learning is effective in optimizing policies for recommender systems. Current solutions mostly focus on model-free approaches, which require frequent interactions with a real environment, and thus are expensive in model learning. Offline evaluation methods, such as importance sampling, can alleviate such limitations, but usually request a large amount of logged data and do not work well when the action space is large. In this work, we propose a model-based reinforcement learning solution which models the user-agent interaction for offline policy learning via a generative adversarial network. To reduce bias in the learnt policy, we use the discriminator to evaluate the quality of generated sequences and rescale the generated rewards.
Dynamic Surrogate Switching: Sample-Efficient Search for Factorization Machine Configurations in Online Recommendations
ล krlj, Blaลพ, Schwartz, Adi, Ferleลพ, Jure, Kopiฤ, Davorin, Ziporin, Naama
Hyperparameter optimization is the process of identifying the appropriate hyperparameter configuration of a given machine learning model with regard to a given learning task. For smaller data sets, an exhaustive search is possible; However, when the data size and model complexity increase, the number of configuration evaluations becomes the main computational bottleneck. A promising paradigm for tackling this type of problem is surrogate-based optimization. The main idea underlying this paradigm considers an incrementally updated model of the relation between the hyperparameter space and the output (target) space; the data for this model are obtained by evaluating the main learning engine, which is, for example, a factorization machine-based model. By learning to approximate the hyperparameter-target relation, the surrogate (machine learning) model can be used to score large amounts of hyperparameter configurations, exploring parts of the configuration space beyond the reach of direct machine learning engine evaluation. Commonly, a surrogate is selected prior to optimization initialization and remains the same during the search. We investigated whether dynamic switching of surrogates during the optimization itself is a sensible idea of practical relevance for selecting the most appropriate factorization machine-based models for large-scale online recommendation. We conducted benchmarks on data sets containing hundreds of millions of instances against established baselines such as Random Forest- and Gaussian process-based surrogates. The results indicate that surrogate switching can offer good performance while considering fewer learning engine evaluations.
Technique protects privacy when making online recommendations
Algorithms recommend products while we shop online or suggest songs we might like as we listen to music on streaming apps. These algorithms work by using personal information like our past purchases and browsing history to generate tailored recommendations. The sensitive nature of such data makes preserving privacy extremely important, but existing methods for solving this problem rely on heavy cryptographic tools requiring enormous amounts of computation and bandwidth. MIT researchers may have a better solution. They developed a privacy-preserving protocol that is so efficient it can run on a smartphone over a very slow network.
Achieving Counterfactual Fairness for Causal Bandit
Huang, Wen, Zhang, Lu, Wu, Xintao
In online recommendation, customers arrive in a sequential and stochastic manner from an underlying distribution and the online decision model recommends a chosen item for each arriving individual based on some strategy. We study how to recommend an item at each step to maximize the expected reward while achieving user-side fairness for customers, i.e., customers who share similar profiles will receive a similar reward regardless of their sensitive attributes and items being recommended. By incorporating causal inference into bandits and adopting soft intervention to model the arm selection strategy, we first propose the d-separation based UCB algorithm (D-UCB) to explore the utilization of the d-separation set in reducing the amount of exploration needed to achieve low cumulative regret. Based on that, we then propose the fair causal bandit (F-UCB) for achieving the counterfactual individual fairness. Both theoretical analysis and empirical evaluation demonstrate effectiveness of our algorithms.
Generative Inverse Deep Reinforcement Learning for Online Recommendation
Chen, Xiaocong, Yao, Lina, Sun, Aixin, Wang, Xianzhi, Xu, Xiwei, Zhu, Liming
Deep reinforcement learning enables an agent to capture user's interest through interactions with the environment dynamically. It has attracted great interest in the recommendation research. Deep reinforcement learning uses a reward function to learn user's interest and to control the learning process. However, most reward functions are manually designed; they are either unrealistic or imprecise to reflect the high variety, dimensionality, and non-linearity properties of the recommendation problem. That makes it difficult for the agent to learn an optimal policy to generate the most satisfactory recommendations. To address the above issue, we propose a novel generative inverse reinforcement learning approach, namely InvRec, which extracts the reward function from user's behaviors automatically, for online recommendation. We conduct experiments on an online platform, VirtualTB, and compare with several state-of-the-art methods to demonstrate the feasibility and effectiveness of our proposed approach.
Retailers Use AI to Improve Online Recommendations for Shoppers
Wayfair Inc. has 37,173 kinds of coffee mugs for sale. Factor in different colors, sizes or materials, and the range of options rises above 70,000. It's Jim Miller's job to help shoppers find the mug they want--along with a set of espresso cups, a waffle iron or other products they didn't know they wanted. In the past five years, the company's success rate has jumped 50%, measured by the number of clicks it takes for a customer to add an item to their carts and how often they buy those items, among other variables, according to Mr. Miller, the Boston-based online retailer's chief technology officer. He credits the gains to advances in smart software.